Skip to content

Conversation

@saksham-1304
Copy link

@saksham-1304 saksham-1304 commented Dec 25, 2025

Fix: Preserve multimodal content during same-turn handoffs

Fix #2220

Summary

This PR fixes an issue where multimodal content (images, files, audio) uploaded by users was being lost when the agent performed a handoff to another agent in the same turn with nest_handoff_history=True (the default behavior).

Problem: When a user uploaded an image and the first agent handed off to a specialist agent, the target agent lost access to the uploaded image because the nest_handoff_history function converted all content to a plain text summary, discarding the multimodal data.

Solution: Modified nest_handoff_history to:

  1. Extract and preserve multimodal content from user messages during the handoff
  2. Include the preserved multimodal content as a separate user message alongside the text summary
  3. Provide a cleaner summary that indicates the presence of multimodal content without dumping raw JSON
  4. Mark preserved messages to prevent duplication across chained handoffs

Changes

Core Fix

  • src/agents/handoffs/history.py:
    • Added _MULTIMODAL_CONTENT_TYPES constant to identify multimodal types
    • Added _PRESERVED_MULTIMODAL_MARKER constant to mark preserved messages and prevent duplication
    • Modified default_handoff_history_mapper() to extract, preserve, and carry forward multimodal content
    • Added _extract_multimodal_content() function to scan user messages for images, files, and audio
    • Added _collect_preserved_multimodal_content() function to collect already-preserved content from previous handoffs
    • Added _stringify_content_list() function to provide human-readable summaries of multimodal content

Tests

  • tests/test_extension_filters.py:
    • Added 7 comprehensive tests covering:
      • Image content preservation
      • File content preservation
      • Multiple images across messages
      • Audio content preservation
      • Text-only messages (backward compatibility)
      • Multimodal content in non-user messages (negative case)
      • Chained handoffs (no duplication across A → B → C handoffs)

Testing

✅ All 20 existing extension filter tests pass
✅ All 14 handoff tool tests pass
✅ All 17 run step processing tests pass
✅ All 7 new multimodal preservation tests pass (including chained handoffs test)
✅ Code passes ruff format and ruff check
✅ Code passes mypy --ignore-missing-imports --no-warn-unused-ignores

Test Plan:

  1. Run: pytest tests/test_extension_filters.py -v
  2. Run: pytest tests/test_handoff_tool.py tests/test_run_step_processing.py -v
  3. Run: ruff format && ruff check
  4. Run: mypy src/agents/handoffs/history.py tests/test_extension_filters.py --ignore-missing-imports

Impact

Backward Compatible: Yes. Text-only conversations work exactly as before. Multimodal content is now additionally preserved when present.

Example Usage:

# Before: Image lost during handoff
# After: Image preserved alongside conversation summary
welcome_agent.handoffs = [
    handoff(specialist_agent, input_filter=HandoffFilters.welcome_to_specialist),
    # nest_handoff_history=True by default - now preserves multimodal content
]

Related Issue

Resolves the issue reported about multimodal content being lost during same-turn handoffs with nest_handoff_history=True.

Checklist

  • New tests added for multimodal content preservation
  • All existing tests pass
  • Code formatted with ruff format
  • Code passes linting with ruff check
  • Code passes type checking with mypy
  • Changes are backward compatible
  • Commit messages are concise and in imperative mood
  • Changes follow SDK guidelines for style and documentation

…story

When nest_handoff_history=True (the default), multimodal content (images,
files, audio) from user messages was being lost during handoffs because
the content was converted to a plain text summary.

This fix:
- Extracts multimodal content from user messages before summarization
- Adds the multimodal content as a separate user message after the summary
- Improves the text summary to show '[N image(s) attached]' instead of raw JSON

Fixes issue where target agents lose access to uploaded images during
same-turn handoffs.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Add marker system to track preserved multimodal content:
- Add _PRESERVED_MULTIMODAL_MARKER constant to mark preserved messages
- Skip already-preserved messages during extraction
- Add _collect_preserved_multimodal_content() to carry forward existing
  preserved content across chained handoffs
- Add test for chained handoffs scenario

This addresses the P1 code review feedback about duplicate conversation
turns across chained handoffs.
@saksham-1304
Copy link
Author

@seratch Please merge the PR

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4223ce12f5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +125 to +129
if all_multimodal_content:
user_message: dict[str, Any] = {
"role": "user",
"name": _PRESERVED_MULTIMODAL_MARKER,
"content": all_multimodal_content,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Exclude preserved multimodal messages from summaries

Because preserved multimodal content is appended as a separate user message, _build_summary_message() will include it in the summary. On the next handoff, _flatten_nested_history_messages() expands that summary back into the transcript while the preserved message is still present in input_history, so chained handoffs (3+ in a row) will accumulate duplicate user (__multimodal_preserved__) lines in the summary. This inflates the summary and can confuse downstream agents; consider filtering _PRESERVED_MULTIMODAL_MARKER messages out of the summary or excluding them when flattening.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings January 10, 2026 07:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an issue where multimodal content (images, files, audio) was being lost during agent handoffs when using nest_handoff_history=True (the default behavior). The solution preserves multimodal content from user messages by extracting it and including it as a separate user message alongside the conversation summary.

Changes:

  • Modified default_handoff_history_mapper() to extract and preserve multimodal content from user messages during handoffs
  • Added helper functions to extract, collect, and format multimodal content with deduplication support for chained handoffs
  • Enhanced _stringify_content_list() to provide human-readable summaries indicating presence of images, files, and audio

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/agents/handoffs/history.py Core implementation: added multimodal content preservation logic including extraction, collection, and formatting functions; uses marker to prevent duplication in chained handoffs
tests/test_extension_filters.py Added 7 comprehensive tests covering image, file, and audio preservation; backward compatibility; and chained handoff deduplication

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +110 to 119
# Extract NEW multimodal content from user messages (excludes already-preserved content).
new_multimodal_content = _extract_multimodal_content(transcript)

# Also collect any already-preserved multimodal content from previous handoffs.
existing_multimodal_content = _collect_preserved_multimodal_content(transcript)

# Combine new and existing multimodal content.
all_multimodal_content = existing_multimodal_content + new_multimodal_content

summary_message = _build_summary_message(transcript)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preserved multimodal message (marked with _PRESERVED_MULTIMODAL_MARKER) will be included in the summary text when building the transcript summary. This means the summary will contain lines like "user (multimodal_preserved): [2 image(s) attached]" which exposes the internal marker and creates confusing output. Consider filtering out messages with name == _PRESERVED_MULTIMODAL_MARKER when building the summary in _build_summary_message to avoid including this implementation detail in the user-visible summary.

Copilot uses AI. Check for mistakes.
if isinstance(content, list):
for part in content:
if isinstance(part, dict):
preserved_parts.append(deepcopy(part))
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function appends all dict items from preserved content without validating they are actually multimodal content types. Consider adding a check that part.get("type") is in _MULTIMODAL_CONTENT_TYPES before appending to maintain consistency with _extract_multimodal_content and ensure only valid multimodal content is carried forward.

Suggested change
preserved_parts.append(deepcopy(part))
part_type = part.get("type")
if part_type in _MULTIMODAL_CONTENT_TYPES:
preserved_parts.append(deepcopy(part))

Copilot uses AI. Check for mistakes.
Comment on lines +220 to +235
if part_type == "input_text":
text = part.get("text", "")
if text:
parts.append(text)
elif part_type == "input_image":
image_count += 1
elif part_type == "input_file":
file_count += 1
elif part_type == "input_audio":
audio_count += 1
else:
# Unknown type, try to stringify it.
try:
parts.append(json.dumps(part, ensure_ascii=False, default=str))
except TypeError:
parts.append(str(part))
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function only explicitly handles "input_text" for extracting text content from multimodal messages. Assistant messages may contain "output_text" type parts, which will fall through to the JSON stringification fallback. Consider adding explicit handling for "output_text" similar to "input_text" (extracting the "text" field) to make the behavior more predictable and consistent.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Image (multimodal) content lost during agent handoff in same turn

2 participants